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ABSTRACT 

We present a neural net algorithm for parameter estimation in the context of large 
cosmological data sets. Cosmological data sets present a particular challenge to pattern- 
recognition algorithms since the input patterns (galaxy redshift surveys, maps of cosmic 
microwave background anisotropy) are not fixed templates overlaid with random noise, 
but rather are random realizations whose information content lies in the correlations 
between data points. We train a “committee” of neural nets to distinguish between 
Monte Carlo simulations at fixed parameter values. Sampling the trained networks 
using additional Monte Carlo simulations generated at intermediate parameter values 
allows accurate interpolation to parameter values for which the networks were never 
trained. The Monte Carlo samples automatically provide the probability distributions 
and truth tables required for either a frequentist or Bayseian analysis of the one observ¬ 
able sky. We demonstrate that neural networks provide unbiased parameter estimation 
with comparable precision as maximum-likelihood algorithms but significant computa¬ 
tional savings. In the context of CMB anisotropies, the computational cost for param¬ 
eter estimation via neural networks scales as The results are insensitive to the 

noise levels and sampling schemes typical of large cosmological data sets and provide a 
desirable tool for the new generation of large, complex data sets. 
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1. Introduction 

A fundamental question in cosmology is the origin and evolution of large scale structure in 
the universe. The standard model for this evolution is the gravitational growth and collapse of 
initially small perturbations in the primordial density distribution. This picture is supported by 
the detection of primordial Cosmic Microwave Background (CMB) temperature anisotropies at a 
level of approximately one part in 10^ by the Cosmic Background Explorer (COBE) satellite and 
a series of ground-based and balloon-borne experiments. Small perturbations on the matter and 
energy density in the early universe are reflected in the temperature distribution of the CMB, 
providing a “snapshot” of conditions the early universe while the perturbations were still in the 
linear regime. 

One angular scale of particular interest is the horizon size at the surface of last scattering, the 
epoch when the universe cooled sufficiently to form neutral hydrogen and allow the CMB photons 
to propagate freely. Causally-connected regions at the surface of last scattering, as viewed from 
the present epoch, subtend an angle 

e ~ 1.7° 

where zis is the redshift at last scattering and flo is the total density of the universe relative to the 
critical (closure) density. Anisotropy on scales larger than ~ 2° reflect perturbations larger than 
the particle horizon and thus probe the primordial density distribution. On scales smaller than 2°, 
causal mechanisms become important and modify the primordial density in model-specific ways. 

From the one observable sky, we want to infer the values of these cosmological parameters with 
minimal uncertainty in the shortest possible time. Theoretical models do not predict a specific 
template for the CMB anisotropy (a hot spot at this location, a cold spot over there), but rather 
predict a statistical distribution usually expressed in terms of the angular power spectrum. Deriving 
the power spectrum from the data (or more generally, deriving model parameters directly from the 
sky maps) involves accounting for angular correlations between pixels, precluding use of simple 
linear least-squares techniques. The accepted standard in the CMB community has been the 
generalization of least-squares techniques as implemented in maximum likelihood algorithms (see, 
e.g., Gorski et ah 1996; Tegmark, Taylor, & Heavens 1997; Bond, Jaffe, & Knox 1998; Borrill 1999). 

The simplest method of parameter estimation uses a goodness-of-ht test to compare a set of 
observables yi ± ai measured at a set of positions Xi to a theoretical model Tj. If we have data 
points yi and Np parameters pj, we define 



where 

Np 

ri(a:) = '^pjXj{xi) 

i=i 


( 2 ) 
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is function of the parameters p and some fixed basis functions X(x) 
parameter values by minimizing with respect to the parameters, 

dx^ 


dp. 


= 0 


We obtain the “best-fit” 


(3) 


for the parameter pj. The least-squares system in Eq. 3 has the solution 


Np 



Pj — (A )jkBj 

(4) 


k=l 


where 

Aj(xi)Xfc(xj) 

^jk = 2^ 2 

(5) 


i=l ^ 


is an Np x Np matrix, and 

yiXk{xi) 

= ^2 
i=\ ^ 

(6) 

is a vector of length Np. 




If the data points are not independent, this relatively simple calculation becomes much more 
costly. Covariance between the observed data points can result from instrumental artifacts (cor¬ 
related noise, instrumental resolution, oversampled data) or from correlations in the underlying 
signal (for instance, measuring in real space a signal whose components are independent in Fourier 
space). Equation 1 can be generalized to include the effects of covariance, 

Nd Nd 

= E - r.)(M-i),,(y, - r,), (7) 

i=l j=l 

where 

M,, = {{y,- (E,)) {y, - (E,)) ) (8) 

is the Nd X covariance matrix and the brackets denote an ensemble average. 

Conjugate gradient techniques can solve for without expliciting solving for and thus 
avoiding the 0{N^) operations this would incur. But if the covariance matrix M depends on 
the parameters pj, then minimizing x^ will produce biased estimates for pj. Maximum-likelihood 
parameter estimation provide a tool to overcome this limitation. For a multivariate Gaussian 
distribution, the probability of obtaining the observed data yi given a set of parameters pj is 

C = P{y\p) = (27r)-^''/2 (9) 

where is defined in Eq. 7. The “best” choice of parameters is that which maximizes the likelihood 
function C. The curvature of the likelihood surface about the maximum defines the uncertainty in 
the fitted parameters. 


% > V'(F-i),-, 


( 10 ) 
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where 

Fij = { ) 

OpiOpj 

is the Fisher information matrix (Kendall & Stuart 1969) and L = — log(/l) (see Bunn &: Sugiyama 
1995; Vogeley &: Szalay 1996; Tegmark et al. 1997; Bond et al. 1998). 

The maximum likelihood estimator is unbiased and asymptotically approaches the equality 
in Eqn. 10. However, these advantages come at a steep price: both the the determinant 

calculation in Eq. 9 scale as 0{N^), making brute-force techniques computationally infeasible. For 
Nd > 10® the time required is measured in years, even on the most powerful supercomputers. A 
number of authors have suggested ways around this limitation, see e.g. Oh Spergel & Hinshaw 
1999, Hivon, et. al. 2001 and Dore, Knox & Peel 2001. Such methods usually rely on symmetries 
in the data, i..e, axial symmetry of the noise, while neural networks need to make no assumptions 
about symmetries. 



Neural networks provide an alternative for astrophysical parameter estimation.They have been 
used previously in astronomy for galaxy classihcation (Lahav et. al. 1996; Andreon et. al. 2000) 
and periodicity analysis of unevenly sampled data as applied to stellar light curves (Tagliaferri et. al. 1999). 
They have also been used to analyze stellar spectra (Bailer-Jones et. al. 1997; Bailer-Jones et. al. 1998; 
Bailer-Jones 2000), with results comparable to traditional methods. However, Bailer-Jones et. al. 
compared data to a deterministic model (stellar spectra), whereas cosmological applications exam¬ 
ine random patterns drawn from parameterized stochastic models. We demonstrate the generality 
of our algorithm by considering different problems with the same network architecture. We find the 
computational cost for training the network, in the context of CMB anisotropy, requires 
operations and thus provides a substantial improvement over brute-force maximum-likelihood meth¬ 
ods. 


The stochastic nature of our models is fundamental; their starting point is the quantum nature 
of the early universe. What the models predict are the parent populations from which individual 
realizations are to be drawn. Our single observed universe is assumed to be such a realization. The 
question becomes: given a particular model, which parameters that define the parent population 
best describe the observed data? This is to be contrasted with the problems of steller spectra 
or galaxy classification, which have at their heart either a deterministic model or a pre-assigned 
catalogue of types. 

This stochastic nature means any analysis of the observed data must rely on some statis¬ 
tical test. Leaving aside for the moment the computational challenges of traditional maximum 
likelihood methods, this raises the problem of knowing the best statistic for distinguishing be¬ 
tween different parameters or models (we can view model selection as a discrete parameter to be 
estimated). Maximum likelihood methods require an a priori definition of a goodness-of-fit func¬ 
tion. The choice of a goodness-of-fit function is not always obvious and is particularly acute for 
2D and 3D surveys. Much of the information lies in the phase features of these surveys. Sta¬ 
tistical tests can fail badly in detecting phase features, as witness the large literature devoted to 
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the relatively simple problem of edge detection in 2D data sets (see, e.g., Hough 1962 and Davis 
1975). Topological tests such as the genus or other Minkowski functionals have been applied to 
2- and 3-D maps, (Gott et. al. 1990; Kogut et. al. 1996) but the relative power of these statistics 
is poor (Phillips & Kogut 2001). Neural networks, in contrast, do not require specification of a 
single statistic of a priori interest. As the network is trained, it determines how it will discriminate 
between competing models. 

All observations add instrumental effects to the desired physical signal. Any analysis of the 
data must model these effects. For neural networks, this present no great challenge as long as we 
can model the effects. For either irregularly sampled time series data or 2D/3D survey data with 
missing patches, any analysis based on Fourier methods will suffer from alaising of power. With 
neural networks, this problem does not arise; by excluding from the simulations the data points 
missing in the observed data, the networks will learn the effects of the gaps. The inclusion of noise 
to the simulations may increase the number of training passes needed, but will not prevent the use 
of neural networks for parameter estimation. 


2. Neural Network 

Neural networks can be trained to estimate the value of a continuous parameter, and can re¬ 
liably interpolate to parameter values intermediate between the training values. Though this idea is 
not new to astrophysics, e.g. (Bailer-Jones et. al. 1997; Bailer-Jones et. al. 1998; Bailer-Jones 2000), 
our method is fundamentally different from that of Bailer-Jones et. al. in that our patterns are 
random samples drawn from a parameterized parent population. The Bailer-Jones et. al. method 
is a matter of template matching; randomness only enters their input patterns as instrument noise. 
For CMB and other cosmological data, the patterns themselves are intrinsically random. Nonethe¬ 
less, using the same basic neural network architecture, we can train the networks to discriminate 
stochastic patterns that differ according to the parent population from which they are drawn. 

We start by training a network to differentiate between simulated data sets (including in¬ 
strument noise and other artifacts) generated at a pair of discrete parameter values. The back- 
propagation adjusts the weights until the network outputs target value 0 when presented with the 
first set of patterns, and target value 1 when presented with the second. In practice, since the infor¬ 
mation distinguishing different parameter values is in the correlations, not the actual pixel values, 
any single network will not train to a sufficient degree. We improve the situation by training a 
small committee of networks and polling them to get a consensus opinion. Now we find simulations 
generated at the training parameter values produce two well defined peaks. Simulations generated 
with an intermediate parameter value (never present in training data) yield outputs peaking some¬ 
where in between, depending on whether the new parameter value is closer to the first or second 
training value (see Fig 2). 

We quantify this behavior by presenting the trained network with a set of new inputs drawn 
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from a grid of intermediate parameter values, and derive for each intermediate parameter value the 
corresponding probability distribution of output values. When the networks are later presented 
with an unknown pattern, each distribution gives the probability that the unknown pattern was 
generated with a parameter value corresponding to that grid point. The interpolated parameter is 
the probability-weighted mean. The grid samples all use the same trained networks; the sampling 
of the networks at different grid points is faster than the training since we usually need many 
more training sets than sampling sets, thus no great computational cost is incurred. This sort 
of sampling of trained networks becomes a key component in utilizing a Bayesian approach to 
parameter estimation, see Christensen and Meyer 2000 and Rocha et. al. 2000 for discussions of 
the Bayesian approach in the context CMB anisotropies and MacKay 1995, along with Bishop 1995 
for neural networks. Although we focus below on estimating a single parameter, the method is 
readily extended to multi-parameter fits. 


2.1. Parameter estimation via the Bayesian approach 

We use the standard MLP: where y\. is the outout of neuron k for 

layer /, Ni is the number of neurons in layer I and is the weight connecting neuron i of layer 
I — 1 with neuron k of layer 1. We include a bias for each neuron via Uq = 1. The activation 
function is the sigmoid f{x) = 1/(1 -|- e~^). We only consider networks with the fully-connected, 
single hidden layer topology and a single output neuron. The number of weights per network is 
A^wgt = Ahi(j(jen (Ainput T 2) -|- 1. 

We use standard backpropagation for training, with weight updates after each training point. 
The quadratic cost function is used: E = ^(opatt — ipatt)^, where Opatt = layer-g 

network output when presented with the input Xpatt and fpatt is the desired target for that pattern. 

The input patterns X are the end product of our simulations and as such are realizations taken 
from an underlying parameterized probability distribution with parameter p. Thus we label each 
input pattern by the parameter from which it was drawn: X(p). Assuming the data lies between 
two extreme values and p^^^ we train a network to the target = 0 for realizations 
and = 1 for X(p(^)) by repeatedly presenting the network with new samples at p^^^ and p^^'^ 
and back propagating the error. Once trained, the network will give an output o (X) between 0 
and 1 for any input parameter value. If p < p^^\ the output clips at 0, while if p > p^^^ the output 
clips at 1. 

Once we have trained a network, we can present additional, statistically independent samples 
drawn at p^®^ and p^^^. Figure 2a shows the output distributions for 1000 patterns of each parameter. 
We see the network has successfully trained in that the p*-®^ distribution is peaked at o = 0 while 
the p^^^ samples at o = 1. To be able to interpolate to intermediate values, we will need to present 
samples drawn at intermediate parameter values. Fig 2b shows the output distributions for an 
additional 1000 samples each for two parameters, one just a little larger than p^^^ and one a little 
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smaller than These distribution also show the same tendency to peak close to the limits of 
0 and 1, but not as strongly as those drawn at the trained parameters. In effect, the network 
is chosing which of the training parameters these new patterns, for which it was never trained, 
most closely resemble. At it stands now, this tendency makes it hard to construct the probability 
distributions we need for parameter estimation. 


By using a committee of networks, we take advantage of this peaking tendency. We want to 
determine the committee consensus and from this get the distributions we seek. The first step is 
converting the continuous output value into a discrete truth values 0 or 1. For each trained network, 
we associate a midpoint value Omid and for any input pattern X, we define its truth value according 
to 


0; O (X) < Ojuid 
1; O (X) > Ojuid 


( 12 ) 


We interpret t(X) = 0 as indicating the pattern was drawn from the parent population with 
parameter and similarly we associate t = 1 with p^^\ 


To determine Omid) we present N samples drawn at p^^^ and N samples at p^^\ For each 
of these sets and any Omid; we obtain the truth values t® and i = With this, 

~ number patterns drawn at correctly identified as drawn at and 

similarly at p^^\ We chose Omid to maximize fc = and refer to fc as 

the fraction correct, our main measure of how well a network has trained. In Fig 2a, Omid is marked 
with the vertical line and we find fc = 94%. 


We note fc allows us to determine the optimal number of training passes, Ativa in . Starting 
with an initially randomized network, as the network trains, we intermittently pause the training 
and sample the network to determine fc- It steadily increases to a maximum value and then levels 
out; the minimum for the training error E has been reached. We take Arrain to be just where this 
plateau starts and thus avoid overtraining. 


For each network, any given input pattern is converted into discrete truth values t. We now 
form a committee of such networks, where the only difference between the networks is the initial 
randomization of the weights. We find committee sizes Anet ~ 50 sufficient. After presenting any 
given pattern X to the committee, we have the collection of truth values t(X)m, m = 1,..., Anet- 
We view each as the vote from network m as to whether the pattern resembled those drawn at 
p(o) Q];. p(i) _ committee consensus is formed by generating the average truth value 


t(X) 


1 

Anet 


Nnet 

E 

m=l 


(13) 


Figure 2c shows the distribution of average truth values for the same set of samples drawn at p^^^ 
and p^^'i used in Fig 2a. They are now even more sharply peaked about t = 0 and t = 1 and 
in terms of the average truth value, fc = 100%. More important are the distributions displayed 
in Fig 2d. These are for the same intermediate samples as used in Fig 2b; we have now two well 



defined distributions with peaks intermediate to the peaks for the and samples. This is 
the general trend when we work in terms of the average truth value: as the parameter p is swept 
from p^^^ to p^^\ we get well defined distributions whose peak moves from t ~ 0 to t ~ 1. 


When we present our committee with an observed pattern, Xobsj for which we do not know 
the parameter, we get its average truth value tobs- From the Bayesian viewpoint, we are interested 
in the posterior distribution 'P(pltobs): given the observation, what is the probability the true 
parameter is p. With this, it is straightforward to determine the mean estimated parameter for 
this pattern, along with our confidence for this estimate. We use Bayes Theorem to express this in 
terms of the prior distributions, 


7^(p|tobs) = ^(tobslp)^^, (14) 

all of which are readily determined by sampling our committee. 

Selecting K parameter values uniformly distributed between the training values, p^ = p ^^^, p^^'^ + 
Ap, p^^^ + 2Ap, ..., p ^^^, we generate N samples at each of these parameter values. The sam¬ 
ples are presented to the committee of networks and thus for each sample X* drawn at each 
parameter value p^, its average truth value is computed. The t’s will take on discrete values, 
t = 0, 1/Xnet, 2/A^net, • • •, Ij and we let be the number of samples drawn at p^ with = j/Naet- 
Since we have the same number of patterns in each of the K sample sets, the prior for the param¬ 
eter is essentially uniform: V{p) = S{p — p^)/K. The probability of getting any of the possible 
discrete values of t is proportional to the total number of times it occurs for all the samples: 
'P(tj) = rij. And finally, given the true input parameter value is p^, the probability of the 

committee generating the average truth value ij is given by = n^/N. Taken together, we 

have the posterior probability distribution 


V{p\tj) 


Ek^ip-p^hj 

Ek' nf 


For our observed pattern, we have the parameter estimate 


Pohs — 




(15) 


(16) 


i.e., the parameter-weighted mean. We also get the 68% confidence width for our estimate without 
any extra work: 

Efc<bs 

Having the probability distribution Eqn 15 allows us to determine explicit upper and lower confi¬ 
dence intervals, along with the standard error of our parameter estimate. 


So far we have outlined how to make a single pass through our parameter estimation method. 
We must also determine when we have converged on the best estimate. There are three ways the 
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results generated by the above algorithm can be sub-optimal: i) the true parameter is is either too 
close to either end of the training range or outside it; ii) the training range is too broad; 

or hi) the training range is too narrow. The distributions used to determine Eqn 15 also allow us 
to test for each of these cases. If the hrst case did occur, then we have — Pobs| < o'(Pobs) for 
either i = 0 or i = 1. Assuming this is not the case, if cT(pobs) is much smaller than —p^^\ then 
we have too broad a training range. We should chose a new training range centered around the 
current estimate pobs and a width comparable to (T(pobs)- A too narrow training range is indicated 
by c(Pobs) ~ and a new broader range needs to be used. 

We find it beneficial to supplement the above tests by presenting the committee of trained 
networks with an independent set of sample patterns drawn at the estimated parameter. The 
distribution of estimated parameters for this set should not peak near either end of the training 
range and have a width comparable to the training range. The range is too narrow if the distribution 
does not have a well-defined shape; is flat around the range. In practice convergence typically 
requires only two or three iterations. 


3. Application: Cosmic Microwave Background Maps 

To illustrate this algorithm, we fit the spectral index for CMB anisotropies based on a Gaussian 
model, an example of the type of problem that will need accurate methods with low computational 
cost. Deriving cosmological parameters from maps of the cosmic microwave background usually 
involves maximum likelihood algorithm whose computational cost (N^) makes them prohibitive for 
mega-pixel maps. Our neural network method provides rapid parameter estimation for CMB maps. 

On angular scales 9 > 2°, anisotropy in the cosmic microwave background corresponds to pri¬ 
mordial density perturbations with scale-free power spectrum (x k^, where k is the wavenumber 
and n is the power-law index. This comes about as quantum fluctuations of the metric during 
inflation are pushed outside the causal horizon, becoming classical in the process. Later, after 
inflation has ended, the causal horizon catches up to the fluctuations and they reenter. 

We expand full-sky anisotropy maps in terms of spherical harmonics: 

A 

(P) = ^ aimYimiO, (p). (18) 

l^m 

Since the primordial fluctuations are assumed to have amplitudes that are independent Gaussian 
processes for each wavenumber, maps are readily generated by projecting these processes onto the 
expansion coefficients aim- This amounts to letting the a/^’s be random Gaussian variables with 
zero mean and the variance (Bond and Efstathiou 1987) 

/|„ | 2 \ (5Q^\ r(/ + (n-l)/2)r((9-n)/2) 

W Im\ / Y{1 + (5 - n)/2)r((3 + n)/2)' 


( 19 ) 
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Our maps are parameterized by the spectral index n and the amplitude Q. We now see clearly 
why our maps are stochastic patterns; it is only the statistical properties that are specified by 
our model. Moreover, these properties are not local for it is in the harmonic conjugate space of 
our maps that we draw independent random variables. Our networks will learn these non-local 
statistical correlations as they are trained on different realizations from the above parameterized 
distributions. 


3.1. Instrument modeling 

To be able to estimate the parameters for observed data sets, we need to include instrument 
effects to the pure theory sky maps we generate. The first effect is the pixelization of the full sky. 
We use the pixelization scheme used by the COBE-DMK data, which results in a total of 6144 
pixels, each of size 2.5° x 2.5°. To generate a single realization, to each aim, up at an Imax = 20, 
we assign a Gaussian random number with width given by Eqn. 19 and evaluate Eqn. 18 with the 
angular coordinates centered on each pixel: {9, (f) {9i, 4>i), i = 1,..., 6144. This pixel resolution 

oversamples the resolution of the radiometer horn. To account for the radiometer horn profile, we 
smooth each of these full-sky maps with a Gaussian profile with a EWHM of 7°. 

As true for all measurements, the detectors introduce noise into the signal. The scanning 
strategy for COBE was to visit each pixel many times. Each visit is accompanied by the introduction 
of Gaussian noise to the signal. Since each addition of noise to each pixel is an independent random 
process, the more often a pixel is visited, the more the noise is supressed. We model this by adding 
Gaussian noise to each pixel with the variance (20 mK)^/Aobs,i; where is A^obs,i is the number of 
observations of pixel i (Bennett et. al. 1996). (In truth, the pixels are visited in pairs and differences 
measured; the analysis of the above reference takes this into account.) 

Eoreground emission from our Galaxy dominates the COBE data near the Galactic plane, ren¬ 
dering it unusable for cosmological analyses. We use the galaxy cut template of (Banday et. al. 1997) 
to excise pixels with significant Galactic emission. The cut sky represents an additional challenge 
for standard maximum-likelihood analyses. In the absence of this cut, the data sets represent full- 
sky coverage and can be decomposed in terms of orthogonal spherical harmonics. The resulting 
coefficients yield the power spectrum of the CMB and hence the spectral index n. Once the galaxy 
cut is imposed, the spherical harmonic functions are no longer orthogonal on the remaining pixels. 
Any attempt to to obtain a harmonic expansion will result in the aliasing of power between modes 
and an inaccurate power spectrum. Though a new orthogonal set of basis functions can be com¬ 
puted for the cut sky (Gorski, 1994), this is an problem as well. Since neural networks estimate 
cosmological parameters using the real-space pixel values, they need not take the detour through 
the power spectrum, and do not suffer aliasing of power. We simply impose the cut for galactic 
emission and train each network using only the remaining high-latitude pixels. As the network is 
trained , it automatically learns the effect of cuts in the data, without requiring any symmetries in 
the cut data (see e.g. Oh et. al. 1999). 
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The final step in processing our raw theory maps is to fit and remove the dipole and quadrupole 
moments. Our peculiar motion, due to the our motion in the galaxy and the galaxy’s motion 
through the universe, gives rise to a non-cosmological dipole in the observed CMB anisotropy. The 
quadrupole is removed because it is dominated by local galactic emission and at the same time 
containes very little cosmological information. 


3.2. Training, sampling and results 

We generate simulated COBE maps for fixed Q = 20iJ,K and use the spectral index n as 
the parameter to estimate. Working at the COBE-DMR resolution and after the galaxy cut, this 
leaves an input pattern of = 3881 pixels. We use A^mdden = 600 and A^Train = 12000, along with 
the learning rate r] = 0.0188 and initialize the weights with a zero mean Gaussian of width 0.35 
(determined to be the optimal choices). We globally rescale the input patterns so the variance for 
all patterns in both training sets is unity. We can not do this per training set since the variance of 
a sky map is dependent on the spectral index n. We train 50 networks over the parameter range 

= 0.0 and = 2.0. To build the probability distribution Eqn 15, we use K = 17 parameter 
values ranging from and with N = 1000 sample patterns per sample set. Figure 3 plots 
the probability distribution of Uobs for a set of 1000 samples of Um = 1-40 (dotted line). This 
distribution matches with separation of the training sets and we need this only training iteration. 
We recover hobs = 1-30, with 68% of the samples between 1.07 and 1.51. In terms of our Bayesian 
analysis, for n ~ 1, we determine a = 0.35, in agreement with the traditional maximum likelihood 
analysis of the COBE-DMR 2 year data (Gorski et. al. 1994). 

Note that the neural network algorithm recovered the correct spectral index even though 
none of the networks used were trained at this value. The uncertainty, derived from the width 
of the probability distribution of nobs; is comparable to the value predicted by the maximum 
likelihood method. Neural networks can recover cosmological parameters from CMB data sets with 
comparable precision as maximum likelihood techniques, but using calculations instead of N^. 


4. Scaling of Computational Cost 

Neural nets have many desirable characteristics for parameter estimation with mega-pixel 
CMB maps. They operate globally on the data and return unbiased estimates of the underlying 
parameter values. They automatically account for data gaps, instrument noise, and other features 
peculiar to a particular data set. Most importantly, the computational costs are low enough to 
allow extension to the mega-pixel data sets expected in the near future. 

The dominant contribution to the CPU cost is the array multiplication associated with the 
weights. This multiplication is performed twice per training pass: once for evaluating the pattern, 
then again for back propagating the error. Each operation scales as the number of weights. Our 
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total computational cost for training a network is 

-^CPU = 2AfTrain (-/^Hidden {Nd + 2) + 1) (20) 

How this cost scales with Nd depends on the problem being considered. 

We derive the scaling for CMB maps by analyzing the CPU costs as progressively larger and 
larger areas of the sky are covered. In this way, new information is introduced into the data sets 
as the patch size increases. The S/N ratio per pixel is fixed in this scheme, reflecting the trend in 
current experiments of scanning ever larger portions of the sky at (roughly) constant S/N ratio per 
pixel. 

We select circular patches of sky centered at the north zenith. The range of patch sizes are 
chosen to cover 1.5 orders of magnitude in Nd- The number of training passes and the number of 
hidden units depends on the number of input pixels, i.e., the patch size. For each patch size, we 
determine which ^Hidden gives the best Nqpxj for a fixed training accuracy. For a range of ^Hidden) 
we train multiple networks on large sets of patterns, also varying the learning rate and width of 
the Gaussain used to initialize the weights. As the networks are being trained, we monitor their 
ability to correctly classify independent samples via fc introduced earlier. Once this ability passes 
a pre-set threshold, we know NTrain for each network, and hence Ncpu- We repeat for multiple 
networks to estimate the uncertainty in the CPU cost. For any fixed patch size, we find the optimal 
learning rate strongly depends on ^Hidden, while the results for Ncpu are not strongly dependent 
on the precise value of ^Hidden- Above a minimum value, as ^Hidden increases, the number of 
training passes needed decreases in such a way that A^CPU remains constant over a wide range of 
A^Hidden- The results are shown in Figure 4. The computational cost for training a CMB network 
scales according to A^CPU ~ ^d^^ ^ considerable improvement over the Aj scaling behavior for a 
maximum likelihood analysis. 


5. Discussion 

We have shown neural networks can be used as a tool for astrophysical parameter estima¬ 
tion. For specificity, we have worked in the cosmological context of CMB anisotropy maps where 
the stochastic nature of the problem is fundamental. The results are insensitive to noise levels 
and sampling schemes typical of large astrophysical data sets and provide parameter estimation 
comparable to maximum likelihood techniques. 

If we classify parameter estimation techniques as to whether they are forward or reverse algo¬ 
rithms, we see the real strength of neural networks. Maximum-likelihood methods are an example of 
reverse algorithms. They start with the statistic under consideration and work backwards, inverting 
a covariance matrix, to the likelihood function used to compare different parameter choices. For¬ 
ward algorithms provide a way to avoid the high computational costs of inverse methods. Typically, 
it is much simpler to generate model predictions at each sampled point in parameter space than to 
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compute the matrix inverse and determinant required for maximum likelihood techniques. Forward 
algorithms trade many realizations of synthetic data sets computed at specific parameter values for 
the computationally infeasible matrix inversion. Neural networks are such an algorithm; synthetic 
data sets are used to both train and sample the networks. This gives us our speed improvement. 

Since either maximum likelihood or neural networks can be viewed as the “machinery” for pa¬ 
rameter estimation, the fundamental information flow stays the same (see Figure 1). The statistical 
confidence levels for the fitted parameters are always accessible. When the “machinery” is sampled 
with independent synthetic data, we can determine the probabilities for making correct or incor¬ 
rect parameter identifications. Such sampling also gives us direct access to the statistical power 
(Phillips & Kogut 2001). While training, the information distinguishing the different parameters 
is encoded in the weights. Interperting the resulting weight matrices is not usually possible (as 
compared to the Fisher matrix, Eqn 11). Using independent sampling of the network to derive the 
probability distributions needed for, e.g. Bayesian analysis, means we do not need direct access to 
the information in the weight matrices. 

Neural networks do not require that we specify one single statistic of a priori interest, a lim¬ 
itation of maximum likelihood As the network is trained, it determines how it will discriminate. 
The information required to separate different parameter points comes from the training set sim¬ 
ulations. This lack of a need for a goodness-of-fit function can make neural networks ideal for 
questions that have been traditionally hard to to answer. These include probing the global topol¬ 
ogy of the universe (Lachieze-Rey &: Luminet 1995; Levin et. al. 1998) and exploring the viable 
of non-Gaussian models of structure formation. Both of these problems involve detecting global 
features in the data sets and there is no strong consensus as to the best statistic to use. The often 
used power spectrum fails to capture enough information to decisively test either hypothesis. If 
there is a non-trivial topology to the universe, then the isotropy of the universe assumed when 
using the power spectrum is broken. Neural networks may turn out to be the ideal method for 
detecting the global features that would be present if indeed the universe does have a non-trivial 
topology. Any non-Gaussianity present in the primordial fluctuations that seed structure forma¬ 
tion is beyond the power spectrum’s ability to identify, since it measures the second moment. The 
bispectrum (Heavens 1998; Ferreira et. al. 1998; Phillips & Kogut 2001) (the harmonic conjugate 
of the three point correlation function) is often used to detect evidence of a departure from pure 
Gaussian. Neural networks, when combined with a non-Gaussian theory, will provide a complement 
to bispectrum tests. 

Along with the explicit example we have presented here of Cosmic Microwave Background 
anisotropy data, neural networks as a tool for parameter estimation has application to other types 
of data. Redshift surveys such as the 2-Degree Field and the Sloan Digital Sky Survey measure 
the redshift and position on the sky of a large number of galaxies {N ~ 10®), sampling the quasi- 
linear regime ~ 100/i“^ Mpc where h is the Hubble constant in units 100 km s“^ Mpc“^. The 
observed redshift is the sum of the Hubble flow and the peculiar velocity induced by gravitational 
acceleration in the evolving density field. Coherent flows on large scales produce artifacts in the 
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redshift distribution compared to real space. Galaxies on the far side of an overdensity tend to flow 
toward the center (hence toward the observer) so that their peculiar velocities subtract from the 
Hubble flow, making them appear closer than they really are. Galaxies on the near side move the 
opposite direction, so their peculiar velocities add to the Hubble flow. The net result is an apparent 
enhancement in the galaxy density in redshift space on scales of superclusters, compressing the 
region along the line of sight to the observer. The amplitude of this “bull’s-eye” effect depends 
on the matter density Qm on scales comparable to superclusters of galaxies and can be used to 
determine Qm in model-independent fashion (Praton et. al. 1997; Melott et. al. 1998). 

Estimating 17^ from distortions in redshift space has several problems in practice. The first 
is defining a statistic to quantify the bull’s-eye enhancement in concentric rings about the origin. 
(Melott et. al. 1998) use a large number of simulations to develop an empirical statistic defined as 
the ratio of rms spacing between upcrossings in isodensity contours in the redshift (radial) direction 
to that in the orthogonal (azimuthal) direction. It is thus a local statistic in that it compares high- 
density regions only to other nearby regions, and operates only on a single slice of redshift space 
after smoothing and contouring. 

Neural nets, by contrast, offer a global test by comparing each region of the density field to 
all other regions simultaneously, and can easily be extended across the entire three-dimensional 
survey. No a priori statistic need be identified, nor do neural nets require contouring of the density 
field, thus avoiding the need to “fine-tune’ the selection of contour levels. 

Neural networks offer a promising approach to cosmological parameter estimation, where the 
statistical properties of the primordial matter and energy distribution provide one of the few fal- 
sifiable tests of the standard inflationary paradigm. They do this at a computational cost much 
lower than traditional maximum likelihood methods. In the context of CMB anisotropy maps, this 
cost, 0{N\'^), is better or comparable to the best approximate methods. 
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Fig. 1.— Our basic model for parameter estimation from a data set. We have to a priori assume 
a model to compare the observed data set against; what we pay attention to is the machinery 
for performing this comparison. Maximum likelihood methods, the de facto standard in the CMB 
community must assume a model, just as must be done with currently proposed neural network 
method. 
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Fig. 2.— Sample neural network output distributions, (a) Solid line is the distribution of output 
values for an independent set of samples drawn at the same parameter value for which the 
network was trained to the target 0. The dotted line is for samples drawn at the value for the 
target 1. The vertical line is the midpoint value Omid that maximizes the fraction correct fc, he., 
all output values < Omid are identified as being drawn at p^^^ while the rest at pO), The output 
of the same network, but for samples drawn at parameters intermediate to p^^^ and p^^\ The solid 
line is for p close to p^^^ and the dotted line for a choice close to p^^^. (c) Solid line is the average 
truth value t for the p^^^ patterns, averaged over a committee of 50 networks, the only difference 
between the networks being the initial randomization of the weights. The dotted line is for p^^\ 
(d) The average truth value for the same sets of samples in (b). The averaging has produced two 
well defined peaks that are cleanly separated. Distributions of t like those in (c) and (d) become 
the basis for predicting which estimated parameters to associate with the average truth values, the 
ouput due to presenting a committe of networks with an unknown pattern. 
















- 20 



Fitted spectral index n 


Fig. 3.— Fitted spectral index nobs derived from 1000 realizations of CMB anisotropy sky maps 
with nin = 1.25. The dotted line is for an initial training range of = 0.5 and = 1.5 while 
the solid line is the distribution for the final range of = 0.8 and = 1.4. The fitted values 
correctly peak at the input value (vertical solid line), despite never having trained on this parameter 
value. 
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Fig. 4.— Scaling of computational costs for CMB anisotropy. Working at a fixed sky map resolution, 
we vary the patch size that is examined. This holds the S/N per pixel fixed, but new information 
is introduced as the patch size increases. The solid line represents a power law fit of A^cPU iVi-4 




